R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

## [1] "/Users/maxalekhnovich/Downloads"
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## [1] 13

General Distributions

Description

The graph on the left shows the distribution of fixed acidity. The median is approximately eight and there is a singificant number of outliers both above and below the mean. The bar graph on the right shows the same distribution just in another format

#similar graph to the one above but with volatile acidity
grid.arrange(ggplot(data = redInfo,aes(x = 1, 
                                       y = redInfo$volatile.acidity))+
  ylab("Volatile Acidity levels")+
    ggtitle("Volatile Acidity Distribution")+
               geom_jitter(alpha = 0.1)+
               geom_boxplot(alpha = 0.2, color = "red"),
             ggplot(data = redInfo, aes(x = redInfo$volatile.acidity))+
    xlab("volatile acid levels")+
               geom_histogram(bins=30),ncol=2)

#chloride distribution
grid.arrange(ggplot(data = redInfo,aes(x = 1, 
                                       y = redInfo$chlorides))+
  ylab("chloride Quantity")+
    ggtitle("Chloride Distribution")+
               geom_jitter(alpha = 0.1)+
               geom_boxplot(alpha = 0.2, color = "blue"),
             ggplot(data = redInfo, aes(x = redInfo$chlorides))+
    xlab("chloride levels")+xlim(0,0.25)+
               geom_histogram(bins=30),ncol=2)
## Warning: Removed 25 rows containing non-finite values (stat_bin).

Single Variable analysis of volatile acidity and chloride distribution

## NULL

##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   2.0   Min.   : 6.500   Min.   :0.1800   Min.   :0.0000  
##  1st Qu.: 380.5   1st Qu.: 8.400   1st Qu.:0.3700   1st Qu.:0.2500  
##  Median : 597.5   Median : 9.900   Median :0.4600   Median :0.4400  
##  Mean   : 691.5   Mean   : 9.914   Mean   :0.4847   Mean   :0.4085  
##  3rd Qu.:1059.2   3rd Qu.:11.200   3rd Qu.:0.5900   3rd Qu.:0.5300  
##  Max.   :1562.0   Max.   :15.900   Max.   :1.2400   Max.   :1.0000  
##  residual.sugar     chlorides      free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.0440   Min.   : 3.00      
##  1st Qu.: 1.900   1st Qu.:0.0730   1st Qu.: 6.00      
##  Median : 2.200   Median :0.0840   Median :12.00      
##  Mean   : 2.699   Mean   :0.1052   Mean   :14.87      
##  3rd Qu.: 2.700   3rd Qu.:0.1000   3rd Qu.:20.00      
##  Max.   :15.500   Max.   :0.6110   Max.   :55.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.4000  
##  1st Qu.: 20.00       1st Qu.:0.9964   1st Qu.:3.090   1st Qu.:0.5600  
##  Median : 38.00       Median :0.9974   Median :3.150   Median :0.6500  
##  Mean   : 48.66       Mean   :0.9976   Mean   :3.125   Mean   :0.7076  
##  3rd Qu.: 66.00       3rd Qu.:0.9988   3rd Qu.:3.180   3rd Qu.:0.7925  
##  Max.   :289.00       Max.   :1.0037   Max.   :3.210   Max.   :2.0000  
##     alcohol        quality     
##  Min.   : 8.4   Min.   :3.000  
##  1st Qu.: 9.4   1st Qu.:5.000  
##  Median : 9.9   Median :6.000  
##  Mean   :10.2   Mean   :5.682  
##  3rd Qu.:10.9   3rd Qu.:6.000  
##  Max.   :14.9   Max.   :8.000
##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   3.0   Min.   : 6.000   Min.   :0.180    Min.   :0.0000  
##  1st Qu.: 429.8   1st Qu.: 7.725   1st Qu.:0.370    1st Qu.:0.2100  
##  Median : 810.5   Median : 8.300   Median :0.500    Median :0.3100  
##  Mean   : 811.2   Mean   : 8.644   Mean   :0.511    Mean   :0.3074  
##  3rd Qu.:1188.5   3rd Qu.: 9.400   3rd Qu.:0.630    3rd Qu.:0.4100  
##  Max.   :1590.0   Max.   :13.000   Max.   :1.070    Max.   :0.7600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 1.200   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 2.000   1st Qu.:0.07200   1st Qu.: 7.00      
##  Median : 2.200   Median :0.08000   Median :12.50      
##  Mean   : 2.568   Mean   :0.08600   Mean   :15.28      
##  3rd Qu.: 2.775   3rd Qu.:0.09275   3rd Qu.:21.00      
##  Max.   :12.900   Max.   :0.35800   Max.   :66.00      
##  total.sulfur.dioxide    density             pH          sulphates    
##  Min.   :  6.00       Min.   :0.9906   Min.   :3.220   Min.   :0.370  
##  1st Qu.: 21.00       1st Qu.:0.9960   1st Qu.:3.243   1st Qu.:0.550  
##  Median : 38.00       Median :0.9969   Median :3.270   Median :0.620  
##  Mean   : 49.57       Mean   :0.9969   Mean   :3.269   Mean   :0.648  
##  3rd Qu.: 69.00       3rd Qu.:0.9979   3rd Qu.:3.290   3rd Qu.:0.720  
##  Max.   :165.00       Max.   :1.0029   Max.   :3.310   Max.   :1.560  
##     alcohol         quality     
##  Min.   : 9.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.00   Median :6.000  
##  Mean   :10.37   Mean   :5.663  
##  3rd Qu.:11.07   3rd Qu.:6.000  
##  Max.   :13.40   Max.   :8.000
##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1.0   Min.   : 4.600   Min.   :0.12     Min.   :0.0000  
##  1st Qu.: 406.0   1st Qu.: 6.700   1st Qu.:0.43     1st Qu.:0.0300  
##  Median : 892.0   Median : 7.200   Median :0.56     Median :0.1300  
##  Mean   : 853.5   Mean   : 7.284   Mean   :0.56     Mean   :0.1772  
##  3rd Qu.:1285.0   3rd Qu.: 7.800   3rd Qu.:0.66     3rd Qu.:0.3000  
##  Max.   :1599.0   Max.   :11.600   Max.   :1.58     Max.   :0.7800  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 1.200   Min.   :0.03400   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.06700   1st Qu.: 9.00      
##  Median : 2.100   Median :0.07700   Median :15.00      
##  Mean   : 2.436   Mean   :0.07852   Mean   :16.73      
##  3rd Qu.: 2.500   3rd Qu.:0.08500   3rd Qu.:22.00      
##  Max.   :13.900   Max.   :0.26700   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  7.00       Min.   :0.9902   Min.   :3.320   Min.   :0.3300  
##  1st Qu.: 24.00       1st Qu.:0.9952   1st Qu.:3.360   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9962   Median :3.400   Median :0.6100  
##  Mean   : 43.68       Mean   :0.9962   Mean   :3.434   Mean   :0.6363  
##  3rd Qu.: 58.00       3rd Qu.:0.9973   3rd Qu.:3.480   3rd Qu.:0.7100  
##  Max.   :160.00       Max.   :1.0026   Max.   :4.010   Max.   :1.1600  
##     alcohol         quality     
##  Min.   : 8.70   Min.   :3.000  
##  1st Qu.: 9.70   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.57   Mean   :5.597  
##  3rd Qu.:11.20   3rd Qu.:6.000  
##  Max.   :14.00   Max.   :8.000

Fixed Acididty levels for wines of the highest quality have a specific range between 6 and eight. This is not true of wines of lesser qualities as they can have a wider range of fixed acidity.

The first graph shows that the majority of wines with a high level of pH have very little to no citric acid and also the majority have a total sulfer dioxide level under 80.White wines with medium pH are scattered. Wines with low pH have tend to have less sulfer dioxide than medium pH wines and also have a higher concentration of citric acid levels from 0.35 to 0.6 based on the alpha value of 1/10.

The second graph shows that there is not a significant relationship between pH and quality of wines as wines of the highest quality appear in all three graphs. The citric acid quanitity for high/medium pH wines is lower than the citric acid level for low pH wines with few outliers.

The first graph shows that there is a positive relationship between alcohol and residual sugar. In low quality white wines(3&4), there is a significant rise in the alchol and residual sugars. The average quality wines don’t have a significant increase in residual sugar levels, but do have a more varied level of alcohol. There is a slight increase in alcohol and residual sugar levels for the wines of the highest quality.

#lets get some stats on the density variable
densitySummary = summary(redInfo$density)

#lets get some stats on the total sulfer dioxide variable
totalSulferDSummary = summary(redInfo$total.sulfur.dioxide)

densitySummary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037
totalSulferDSummary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
#lets examine density vs sulfer dioxide
#lets make the colors stand out
 plot1 = ggplot(data = redInfo, aes(x = redInfo$density, 
                                    y = redInfo$total.sulfur.dioxide, 
                                    color = redInfo$quality))+
   scale_color_continuous(low = "blue",high = "red")+geom_point()+
  labs(title = 
 "Density VS Total Sulfer Dioxide",
  x = "Alcohol Level",
y = "Residual Sugar Level")

 
#lets try the same graph with a facet wrap of quality rather than the color
#will also change the color variable to represent the free sulfer dioxide
ggplot(data = redInfo, aes(x = redInfo$density, y = redInfo$total.sulfur.dioxide, 
                            color = redInfo$free.sulfur.dioxide))+
   scale_color_continuous(low = "blue",high = "red")+
  
   geom_point()+facet_wrap(~redInfo$quality)+
   labs(title = "Density VS Total Sulfer Dioxide",
subtitle = "Alcohol",
  x = "Density Level",
y = "Total Sulfer Dioxide")+
  labs(color= "Free Sulfer Dioxide")

#want summary of citric acid and residual sugar before graphing
citricAcidSummary = summary(redInfo$citric.acid)
alcoholSummary = summary(redInfo$alcohol)
residualSugarSummary = summary(redInfo$residual.sugar)
citricAcidSummary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
residualSugarSummary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
alcoholSummary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
#lets try a different combination with the same color scheme, citric acid vs residual sugar, and color equaling alcohol
ggplot(data = redInfo, aes(x = redInfo$citric.acid, 
                           y = redInfo$residual.sugar, 
                           color = redInfo$alcohol))+
  scale_color_continuous(low = "black",high = "yellow")+
  geom_point()+facet_wrap(~redInfo$quality)+
  labs(title = "Citric Acid VS Residual Sugar",
       x = "Citric Acid Level",
       y = "Residual Sugar Level")+
  labs(color = "alcohol")

#graph demonstrates that the quality of the wine is in a particular range in both residual sugar as well as alcohol
#really poor quality wines also have either extremely low citric acid or alot of citric acid based on the graph

The 1st graph illustrates that wines of higher quality tend to have more free sulfer dioxide based on the legend on the right. Also, wines of the highest quality have a density level slightly above or below 0.995 and a total sulfer dioxide level less than 100, with the majority being uder 50. The same can be said about wines with the lowest quality but those wines also have a much lower free sulfer dioxide level.

The 2nd graph illustrates the citric acid level vs the residual sugar level of wines of varying qualities. The lowest quality wines mostly have little to no citric acid and a residual sugar level around 2. The average quality wines have a higher quality of alcohol and a wider range of residual sugar levels. Wines of quality 7&8, the alcohol level is close to the median alcohol level of 10.20.

 #lets try using a facet wrap of a different variable such as residual sugar
 badplot=ggplot(data = redInfo, aes(x = redInfo$density, y = redInfo$total.sulfur.dioxide))+facet_wrap(~redInfo$residual.sugar)
 #obviously thats not gonna work because there is too many residual sugar unique values
 
 #lets try again by factoring the residual sugar values based on the four quartiles
residualSugarSummary=summary(redInfo$residual.sugar)
residualSugarSummary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
redInfo$quality <- factor(redInfo$quality,
                           labels = c(3, 4, 5, 6, 7,8))
ggplot(redInfo, aes(x =redInfo$quality, 
                    y =redInfo$residual.sugar))+
  labs(title = "Quality VS Residual Sugar",
       x = "Quality",
       y = "Residual Sugar Level")+
  geom_boxplot(fill = 'purple', colour = 'orange', alpha = 0.7)+
  coord_cartesian(ylim=c(2,4))+
  scale_y_continuous(name = 'residual sugar', breaks = seq(2,4,.25))

#interesting to note that there are very fewer outliers below Q1
#also after adjusting the breaks, it is easy to see that wines of a higher quality on average have a residual sugar between 2.5 and 3


#lets try total sulfer dioxide vs quality in a similar boxplot
#lets try adding an xlim as well to remove outliers
ggplot(redInfo, aes(x =redInfo$quality, 
                    y =redInfo$total.sulfur.dioxide))+
  geom_boxplot(fill = 'purple', colour = 'yellow', alpha = 0.7)+
  coord_cartesian(ylim=c(0,100))+
  labs(title = "Quality VS Total Sulfer Dioxide",
       x = "Quality",
       y = "Residual Sugar Level")+
  scale_y_continuous(name = 'total sulfer dioxide', breaks = seq(0,100,25))

Based on the above graph, it seems that residual sugar alone does not play a significant part in determining the quality of the wine. Analyizing total sulfer dioxide also shows that there isn’t a significant trend between quality and total sulfer dioxide alone.

#want to compare chlorides and alcohol levels while paying attention to the quality

chloridesAlcoholQuality =ggplot(aes(x = redInfo$quality,
                                   y =redInfo$alcohol),
  data = redInfo) +
  geom_jitter( alpha = .3)  +
  geom_boxplot( alpha = .5,color = 'blue')+
  stat_summary(fun.y = "mean", 
               geom = "point", 
               shape = 8, 
               size = 4)+
  labs(title = "Alcohol VS Quality",
       x = "Quality",
       y = "Alcohol Level")


chloridesAlcoholQuality

There is a clear trend that alcohols of a higher quality have a higher amount of alcohol.Also, this graph shows that most wines have a quality of 5 and 6.

## [1] 3 8
## [1] 2.74 4.01

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

FINAL PLOTS

PLOT #1

## [1] 3 8
## [1] 2.74 4.01

##PLOT 1 ANALYSIS

The console shows the range of both the quality values and the pH values for the dataset.

pH levels for wines have a negative relationship with quality. As the quality improves, the pH level decreases. The pH levels of wines of quality 3 and 4 is approximately 3.4. Wines of the highest quality have a median pH of 3.25

PLOT2

##PLOT 2 ANALYSIS

The 2nd graph illustrates the citric acid level vs the residual sugar level of wines of varying qualities. The lowest quality wines mostly have little to no citric acid and a residual sugar level around 2. The average quality wines have a higher quality of alcohol and a wider range of residual sugar levels. Wines of quality 7&8, the alcohol level is close to the median alcohol level of 10.20.

PLOT 3

## [1]   6 289
## [1]   6 165
## [1]   7 160
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   20.00   38.00   48.66   66.00  289.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   21.00   38.00   49.57   69.00  165.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   24.00   38.00   43.68   58.00  160.00

##PLOT 3 ANALYSIS

The first three numbers show the range and the quartiles for the three subsets made which are low pH, medium pH, and large pH. It’s interesting to note that the medium pH wines and the low pH wines have similar q2 and q3 levels, where as the high pH levels have a lower mean and q3 value than both.

The first graph has a higher range for citric acid and that is why I set the Xlimit a little bit higher for that graph. For wines with low pH, there is a significant amount of wines that have citric acid levels in between 0.4 and 0.6 with total sulfer dioxide levels under 50. The second graph demonstrates that medium pH levels can have a wide range of both total sulfer dioxide and citric acid as evident by there being no dark points due to the alpha parameter. The third graph shows that there are many wines with high pHs that have little to no citric acid and a totaol sulfer dioxide under 60. The high pH graph also has almost no slope in comparison to the other two graphs. This shows that when the pH is high, the total sulfer dioxide and citric levels don’t fluctuate too much when they are both trending in the same direction.

CONCLUSION

After analyzing the dataset of 1599 wines based on the 13 variables provided I came to some conclusions.To start, the quality of the wine is heavily impacted by the pH levels of the wine. Wines of lower quality tend to have a higher value of pH in comparison to wines of a higher quality. Also, wines of higher quality tend to not have too much alcohol or too little.They also did not have too much residual sugar Wines that are very poor tend to have a very small amount of alcohol. Wines that have a larger amount of alcohol are most often average.I also observed a positive relationship with pH, citric acid, and total sulfer dioxide. The wines that had a high pH had a citric acid value of zero or close to zero. The citric acid for low pH wines were 0.5 and medium pH wines were less distinct and had a much wider range, on average. In conclusion, I learned that wines of higher qualities tend to have a good balance of significant factors such as pH, residual sugar, and alcohol levels. Most often when wines had a significantly high or low value for these specific factors the quality of the wine was not very high. I wished there were more variables that would of analyzed either factors such as price or sales. I think it would have made the project more interesting to examine the different wines at different price points, especially in comparison to the quality of the wines. Also, I had some issues with trying to categorize the data initially with some of the variables considering their ranges were very large. The quality variable was perfect in the sense that it’s range was only between three and eight. I overcame this by making other categorical variables such as “lowPH”, “mediumPH” and “largePH” variables that I could use to look at different aspects of the data as well.